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Abstract 

A National Institutes of Health (NIH) workshop was convened in Bethesda, MD on September 26-27, 201 1, with 
representative scientific leaders in the field of proteomics and its applications to clinical settings. The main purpose 
of this workshop was to articulate ways in which the biomedical research community can capitalize on recent 
technology advances and synergize with ongoing efforts to advance the field of human proteomics. This executive 
summary and the following full report describe the main discussions and outcomes of the workshop. 



Executive summary 

A National Institutes of Health (NIH) workshop was 
convened in Bethesda, MD on September 26-27, 2011, 
with representative scientific leaders in the field of pro- 
teomics and its applications to clinical settings. The main 
purpose of this workshop was to articulate ways in which 
the biomedical research community can capitalize on re- 
cent technology advances and synergize with ongoing 
efforts to advance the field of human proteomics. 

Proteins are the major components of biological net- 
works and molecular machines, and proteins are the tar- 
gets for the large majority of drugs available today. 
Participants in this Workshop recognized that a deeper 
knowledge of the human proteome could help fill the 
gap between genomes and phenotypes, transform the 
way we develop diagnostics and therapeutics, and 
thereby enhance overall biomedical research and future 
healthcare. The Human Genome Project and its many 
follow-on initiatives, including the HapMap and 
ENCODE, together with advances in protein sciences, 
have provided a foundation for proteomic technologies 
and informatics resources. Several major initiatives are 
already moving toward deep characterization of the 
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human proteome, including the antibody-based Human 
Protein Atlas, the NIH Common Fund Protein Capture 
Reagents, the mass spectrometry-based Peptide Atlas 
and Selected Reaction Monitoring (SRM) Atlas, and the 
Human Proteome Project organized by the Human 
Proteome Organization. Several leading laboratories 
have demonstrated that about 10,000 protein products, 
of the about 20,000 protein-coding human genes, can 
be identified and quantified in a single experimental 
specimen; this figure may represent nearly the complete 
complement of proteins actually expressed in a single 
cell type. In yeast the complete expressed proteome has 
been identified. Even though a more comprehensive 
characterization of the dynamic aspect of the proteome 
will require further technology development, it is a dis- 
ruptive concept that almost all of the primary products 
of the genome can be detected at the protein level in 
one single experiment. 

The Workshop was organized in five sessions: (1) pro- 
tein networks; (2) integrating proteomics with other 
omics; (3) quantitative proteomics by exploratory and 
targeted methodologies; (4) study design and statistical 
challenges in clinical proteomics; and (5) proteomic 
technologies in a clinical setting. Sessions 1-3 consti- 
tuted a main theme on systems biology; sessions 4-5 
represent a theme on strategies for clinical proteomics. 
The full agenda is at http://www3.niddk.nih.gov/fund/ 
other/HumanProteome2011. This executive summary 
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and the following full report describe the main discus- 
sions and outcomes of the workshop. 

Protein networks: toward a comprehensive wiring 
diagram of human cells 

The interactome network of cells is the complete set of 
macromolecular interactions that take place between 
genes and gene products; it is mostly mediated by pro- 
teins. Pull-down of diverse protein complexes and se- 
quencing of the components, plus other direct 
measurements of protein-protein, protein-nucleic acid, 
and protein-lipids interactions now make it feasible to 
create wiring diagrams for systems biology. The capabil- 
ity of quantifying the main gene products and providing 
information on post-translational modifications and 
splice variants of proteins addresses the dynamic nature 
of the networks. Experimental and natural perturbations 
of cultured cells and of whole organisms can then reveal 
connectivity and can test hypotheses of the blueprint 
and dynamic regulation of phenotypes. Informatics tools 
such as Cytoscape and databases such as BioGRID pro- 
vide means of visualizing pathways, networks, and 
interactomes. 

Integrating proteomics with other omics 

Detailed integration of data and knowledge from mul- 
tiple omics technology platforms is essential for building 
and understanding the pathways from genome to pheno- 
types and the influence of environmental and behavioral 
variables. The influence of allelic variants, splice variants, 
and post-translational modifications must be assessed in 
combined analyses of mRNA and protein abundance and 
response to perturbations. Single nucleotide polymorph- 
isms and alternative splicing can influence sites of post- 
translational modifications, magnifying their downstream 
effects. Linking gene expression, protein expression, and 
metabolomics has become an attractive approach, facili- 
tated by new bioinformatics tools. 

Quantitative proteomics by exploratory and targeted 
mass spectrometry methods 

There has been stunning progress in mass spectrom- 
etry-based proteomics, with new technologies and com- 
binations of instrumentation for bottom-up peptide 
analysis, top-down protein analysis, and targeted quanti- 
tative analysis of proteotypic peptides of selected pro- 
teins. The equipment and reagent sector for proteomics 
is a major economic engine. These advances enable 
much more potent approaches for biomarker develop- 
ment and protein targeting with therapeutic agents. In 
contrast to recent studies limited to the most abundant 
dozens or hundreds of proteins in a biological speci- 
men, current experiments are identifying, with a false 
discovery rate of one percent, and quantifying 



approximately 10,000 proteins from different genes in 
human cell lines, for example. Such deep analyses per- 
mit direct comparison with deep sequencing of the 
transcriptome, as well as with protein expression based 
on immunohistochemistry, as documented in the 
Human Protein Atlas. 

Study design and statistical challenges in clinical 
proteomics 

For the past decade, many of the individual institutes of 
the NIH have supported programs and projects to gen- 
erate potential protein biomarkers for early or more 
specific diagnosis or prognosis of a wide array of dis- 
eases. There are many complex challenges in developing 
such omics-based tests. Heterogeneity of etiology, 
pathogenesis, and responses to therapy among patients 
with identical diagnoses, is common. Knowledge of 
mechanisms, mediated by pathways and networks, is 
fundamental to moving beyond statistical correlation as 
a basis for biomarker development. Integration of data 
across multiple levels of omics analyses should facilitate 
such knowledge development. Several general recom- 
mendations for biomarker discovery projects were also 
made during the discussion. Participants emphasized 
the importance of specifying as early as possible in the 
process the intended clinical use, and the importance of 
proper study design in order to avoid introducing bias. 
Finally, it is important for translational scientists to 
understand the long path of discovery, confirmation, 
validation, clinical trials, and FDA approval to establish 
test validity and utility and gain reimbursement of the 
laboratory service. 

Proteomic technologies in a clinical setting 

The late-stage translation of proteomic technologies and 
protein biomarker candidates into clinical tests requires 
specification of the intended clinical use, sufficient evi- 
dence in preliminary studies to support the investment 
for a large-scale validation trial, demonstration of reliable 
test performance characteristics, and sufficient clinical 
benefit to gain acceptance by the clinical community. It 
is counterproductive to try to short-circuit these com- 
plex steps, especially in the absence of a strong biological 
foundation for the biomarker candidate or panel of bio- 
markers. The development of a roadmap for the transla- 
tion of proteomics technologies into clinical settings will 
require close collaboration between researchers, indus- 
tries, regulators, clinical chemists and clinicians, includ- 
ing private-public partnerships to leverage existing NIH 
programs. Such a partnership would accelerate the devel- 
opment of clinically useful technologies and biomarkers 
and make a significant impact to fulfill the unmet clinical 
needs for patients' personalized health care. 
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Conclusions 

Human proteome networks in health and disease - A 
major scientific opportunity 

A session within the Workshop was dedicated to discus- 
sion of potential scientific opportunities and identification 
of the most compelling ideas for future developments in 
the field of proteomics. Several potential concepts were 
considered in the context of the presentations that were 
made by the speakers. The participants reached consensus 
that a special opportunity exists at this time for utilizing 
modern proteomics to link data from multiple levels of 
omics technologies and build wiring diagrams of human 
cells and tissues through interactome networks and related 
phenomena. The focus on protein interactome networks 
would be salient for all disease processes and would yield a 
stronger foundation for the many NIH institute-specific 
programs seeking a more effective translation to biomarker 
development. 

Full report of the NIH workshop on the human 
proteome 

A National Institutes of Health (NIH) workshop was 
convened in Bethesda MD on September, 26-27, 2011 
with representative scientific leaders in the field of pro- 
teomics and its applications to clinical settings. The main 
purpose of this workshop was to articulate ways in which 
the biomedical research community can capitalize on re- 
cent technology advances and synergize with ongoing 
efforts to advance the field of human proteomics. 

Proteins are the major components of biological net- 
works and molecular machines, and proteins are the tar- 
gets for the large majority of drugs available today. 
Participants in this Workshop recognized that a deeper 
knowledge of the human proteome could help fill the 
gap between genomes and phenotypes, transform the 
way we develop diagnostics and therapeutics, and 
thereby enhance overall biomedical research and future 
healthcare. The Human Genome Project and its many 
follow-on initiatives, including the HapMap and 
ENCODE, together with advances in protein sciences, 
have provided a foundation for proteomic technologies 
and informatics resources. Several major initiatives are 
already moving toward deep characterization of the 
human proteome, including the antibody-based Human 
Protein Atlas, the mass spectrometry-based Peptide Atlas 
and Selected Reaction Monitoring (SRM) Atlas, and the 
Human Proteome Project organized by the Human 
Proteome Organization [1-4]. Leading laboratories have 
demonstrated that protein products of up to -10,000 of 
the ~ 20,000 protein-coding human genes can be identi- 
fied and quantified in a single experimental specimen 
[5-7]; this figure may represent nearly the complete 
complement of proteins actually expressed in a single 
cell type. In yeast the complete set of expressed proteins 



has been identified. It is a disruptive concept that the 
proteome can now be analyzed comprehensively and 
that all of the primary protein products of the genome 
can be detected. 

The Workshop was organized in five sessions: (1) pro- 
tein networks; (2) integrating proteomics with other 
omics; (3) quantitative proteomics by exploratory and tar- 
geted methodologies; (4) study design and statistical chal- 
lenges in clinical proteomics; and (5) proteomic 
technologies in a clinical setting. Sessions 1-3 constituted 
a main theme on systems biology; sessions 4-5 represent 
a theme on strategies for clinical proteomics. The full 
agenda is at http://www3.niddk.nih.gov/fund/other/ 
HumanProteome2011. The following report describes the 
main discussions and outcomes of the workshop. 

Protein networks: Toward a comprehensive wiring 
diagram of human cells (Marc Vidal, Dana-Farber Cancer 
Institute, chair) 

Understanding the detailed mechanistic paths between 
genotypes and phenotypes is one of the most important 
goals of biology, and critical in the quest for new thera- 
peutics. Complex genotype-to-phenotype relationships 
exist in common disorders and traits, but even Mendel- 
ian disorders are complicated by phenomena such as in- 
complete penetrance, variable expressivity, pleiotropy 
and modifier genes. Recent advances in genome biology 
have revealed extremely complex links between geno- 
typic modifications and phenotypic changes in cancers, 
for example. No single phenotype will be fully explained 
by simple changes in any single gene, because gene/en- 
vironment interactions and perturbations of biological 
systems and cellular networks, not single proteins, 
underlie genotype-phenotype relationships [8]. Vidal 
defined the "interactome network" of cells as the 
complete set of macromolecular interactions that can 
take place between genes and gene products, including 
protein-protein, protein-DNA, protein-RNA, DNA-DNA, 
RNA-RNA, enzyme-substrate and post-translational 
modification interactions. Maps of such macromolecular 
interactions generated at the scale of the whole proteome 
will be necessary, although still not sufficient, to fully 
understand biological complexity. He characterized the 
current state-of-the-art on interactome networks as 
similar to the exciting time of the late 1990s for the 
Human Genome Project. He concluded that the com- 
munity is ready to produce a "systematic, unbiased, 
freely available wiring diagram for systems biology" on 
which to add logical and dynamic relationships. Proteo- 
mics is needed to inform those relationships. 

Suzanne Gaudet, Dana-Farber Cancer Institute, pre- 
sented "Predicting phenotypes: quantities and dynamics 
in proteomics". Because of complex systems properties 
that underlie most biological processes, identical 
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genotypes can give rise to very different phenotypic out- 
comes. In her example, HeLa cells, when treated with 
the apoptosis-inducing ligand TRAIL, produce two strik- 
ingly different populations of cells: 80% die and the 
remaining 20% survive [9]. Cells that survive this treat- 
ment are capable of generating the same bi-modal re- 
sponse when treated a second time. She demonstrated 
how this variability is mediated biochemically; computer- 
generated, model-based simulations were able to recap- 
itulate this behavior. In the case of apoptosis, appropriate 
measurements for a dozen proteins accurately predict the 
phenotypic outcome of such cell perturbations [10]. Bio- 
logical experiments over many years and from many la- 
boratories have led to a predictive wiring diagram of this 
specific biological process. For more general proteome- 
scale wiring diagram maps, three major components are 
needed: i) network information on binding partners and 
biochemical reactions; ii) quantitative information on 
protein levels, protein affinities and reaction rates; and iii) 
biosensors to measure response dynamics in situ. 

Kara Dolinski, Princeton University, presented "System- 
atic knowledge capture and representation: the Biological 
General Repository for Interaction Datasets (BioGRID)" 
[11]. She and her colleagues are tackling the challenge of 
generating cellular wiring diagrams by collecting and cur- 
ating published information on macromolecular interac- 
tions from both small-scale bottom-up approaches and 
large-scale proteome-wide mapping enterprises. She pre- 
sented numbers of downloads and volume of traffic on the 
BioGrid website that summarizes information so far for 
yeast and for human biology. Dolinski highlighted exciting 
prospects for visualization of interactome networks 
derived from curating literature to generate Bayesian mod- 
els of disease-specific networks. BioGrid and other data- 
bases have become critical for further development of 
interactome network model-based systems biology. Data- 
bases such as the ProteomeXchange, that have emerged as 
a point of connection for the mass spectrometric prote- 
omic data repositories, should play a key role in the devel- 
opment of a comprehensive wiring diagram of human 
cells. However, a broad strategy for sustained support of 
key data resources is needed. 

Integrating proteomics with other omics (Mark Gerstein, 
Yale University, chair) 

Given the scale of other datasets, particularly those 
derived from next-generation sequencing, much added 
value is achieved from integrating proteomics datasets 
with other data. Four key themes for such data integra- 
tion were identified. 

a) "Direct Integration of mRNA Gene Expression and 
Protein Abundance Datasets". Gerstein described two 
forms: first, a simplified context for the past decade 



comparing levels of mRNAs and their corresponding 
proteins and changes in those levels after a 
perturbation in a time-course experiment or with 
successive measurements of clinical specimens; and, 
second, a more elaborate future context in relation 
to allelic expression, comparing maternal and 
paternal alleles both for gene expression and protein 
abundance using the exact sequences that come 
from mass spectrometry or transcriptome 
sequencing [12-14]. The "future case" allows for the 
examination in detail of the effects of specific 
mutations on gene expression, using the maternal 
and paternal alleles as perfectly matched controls. 
For quantitative proteomics, both for the simpler 
case of comparing molecular concentrations and the 
allelic case, one would need the protein abundance 
sets, preferably including post-translational 
modifications and splice variants, precisely matched 
against RNA-Seq sets. 

b) "Connecting Proteomics Data to the Huge Amount of 
Variation Data". Joel Bader, Johns Hopkins 
University, addressed the idea of connecting 
proteome data, particularly in the form of networks, 
with the huge amount of variation data coming from 
personal genomic sequencing. Other participants 
also emphasized the importance of connecting the 
complex aspects of proteins to the variation data. In 
particular, a single nucleotide polymorphism (SNP) 
can differentially affect different transcripts from the 
same gene. Moreover, SNPs potentially can have 
stronger effects than one might imagine by hitting a 
splice site or a site of post-translational modification 
in proteins; e.g. a splice SNP could result in removal 
of an entire exon. These features could be addressed 
by developing large datasets of protein isoforms and 
linking these against gene annotations [15,16]. 

c) "Multi-dimensional Data Integration". All speakers 
and participants emphasized that proteomics data 
should be integrated with diverse biomedical 
information. Robert Gerszten, Massachusetts 
General Hospital, discussed the importance of 
connecting proteomics data with clinical 
measurements and metabolomics, and Gerstein 
emphasized the usefulness of connecting the protein 
networks with three-dimensional structures of 
proteins and protein complexes. This integration 
opportunity could be further pursued by solving co- 
crystal structures of proteins and using these to 
provide molecular details for interaction networks. 
Gil Omenn, University of Michigan, extended this 
comment by citing current work using I-TASSER 
algorithms to predict three-dimensional structures 
and conformations of pairs of splice variant proteins 
differentially expressed in Her2/neu breast cancer 
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and infer the functional consequences of the 
sequence differences between the splice variants 
[17]. 

d) "The Complexities and Subtleties of Detailed 
Integration". The challenges in achieving data 
integration in the framework of a working database 
system were underscored by Rolf Apweiler, European 
Bioinformatics Institute [18]. He pointed out that, in 
many instances, while one can get most of the 
integration done, there are some unresolved cases, 
including such major aims as connecting the 
genomics data from Ensembl to the proteomics 
information in Uniprot. Zhiping Weng, University of 
Massachusetts Medical School, highlighted how 
chromatin marks might be used to predict gene 
expression information in the framework of an 
integrative model. This approach represents moving 
beyond simply putting together the datasets to 
actually exploring how one dataset might be used to 
predict another. 

Quantitative proteomics by exploratory and targeted 
mass spectrometry methodologies (Matthias Mann, Max 
Planck Institute of Biochemistry, chair) 

The four speakers (Matthias Mann; Joshua Coon, Univer- 
sity of Wisconsin-Madison; Robert Moritz, Institute for 
Systems Biology; Forest White, Massachusetts Institute of 
Technology) addressed a mixture of technological, data- 
centric and functional biological proposals and issues. 
They emphasized the stunning technological progress in 
mass spectrometry-based proteomics [5-7,19-21]. Large- 
scale analyses corresponding to in-depth microarray and 
RNA-Seq methods for gene expression are now feasible. 
Since proteins are the workhorses of the cell, the capabil- 
ities of MS -based proteomics are crucially important to 
obtain a balanced view of the cell and to put genetic and 
genomic findings in biological context. 

"Expression Proteomics". Mann and Coon described re- 
cent and emerging innovations in shotgun proteomics. 
In contrast to studies limited to the most abundant doz- 
ens or hundreds of proteins in a specimen, the proteome 
can now readily be analyzed in great depth and with high 
quantitative accuracy. In yeast, nearly complete coverage 
of the expressed genes has already been achieved. Mann 
presented data obtained using the Orbitrap-Elite plat- 
form where more than 10,000 proteins were identified in 
a single lysate from a human HeLa cancer cell line [7] 
and from 10 other human cell lines, representing a broad 
range of organs of origin. Ruedi Aebersold's laboratory 
has published similar deep quantitation of U20S osteo- 
sarcoma cells [6]. Comparison to deep sequencing 
(RNA-seq) data suggests that the majority of the func- 
tionally active proteome can already be quantified with 
such new technology [5-7]. However, to extend these 



capabilities to protein isoforms and to make them ac- 
cessible to more laboratories, vigorous technology de- 
velopment should be pursued, using combinations of 
mass spectrometry methods and bioinformatics tools 
that detect post-translational modifications. Mann also 
presented a framework for intelligent data acquisition 
and real-time database searching using MaxQuant-Real 
Time, permitting searches according to specific GO 
terms (the example of the kinases activities was illu- 
strated) and finding specific kinds of peptide modifica- 
tions [22]. This capability is necessary for obtaining 
greater sequence coverage of the individual proteins, 
which would be helpful in distinguishing and mapping 
protein isoforms. Coon also described a smart data ac- 
quisition strategy in which the mass spectrometer is 
directed to identify specific peptides. Such strategies 
help to increase proteome coverage in shotgun experi- 
ments because they allow including important peptides 
for sequencing and excluding irrelevant ones. For pro- 
teins up to 50,000 molecular weight, top-down meth- 
ods are becoming very useful [21]. Using this approach 
3,000 different molecular species, representing about 
1,000 main gene products, can be fully characterized 
by mass spectrometry in a single project. The applica- 
tion and further development of these approaches 
should be encouraged. 

"Targeted proteomics with selected reaction monitoring 
(SRM)". Robert Moritz described the SRM Atlas, a 
wholeproteome initiative jointly led by the Aebersold la- 
boratory at ETH-Zurich and the Moritz laboratory at the 
Institute for Systems Biology in Seattle using the Triple- 
Quad mass spectrometers rapidly emerging from mul- 
tiple manufacturers. In Selected Reaction Monitoring, 
several proteotypic peptides distinctive for each targeted 
protein are chosen for their expected transition proper- 
ties in the mass spectrometer and then identified and 
quantified using corresponding heavy-labeled spiked-in 
peptides. Within just 2-3 years, multiple peptides for 
each of the expressed yeast genes and now 99% of the 
20,300 human gene-coded proteins have been prepared; 
their mass spectra have been determined and shared 
publicly through the SRM Atlas. Proteins that are bio- 
marker candidates from discovery phase research can be 
assayed with SRM peptides to facilitate experimental and 
clinical studies across a wide array of diseases. These 
peptide and spectra resources are valuable assets for the 
entire proteomics and life sciences research communi- 
ties. This database of SRM peptide transitions can also be 
used as a reference to interpret experiments in which all 
peptides in a particular mass range are fragmented together. 
Measuring many of these proteins in complex specimens 
like tissue lysates or plasma will require further increases in 
sensitivity, using either anti-peptide antibodies [23] or 
enhanced mass spectrometry. 
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"Post-translational modifications". Besides deep quanti- 
tative and targeted expression proteomics, an area of 
great promise is the large-scale identification, quantifica- 
tion, and mapping of post-translational modifications. 
Tens of thousands of phosphorylation, ubiquitinylation, 
acetylation, and glycosylation sites have been uncovered 
by mass spectrometry. Forrest White pointed out that we 
should now focus on the biological functions of these 
modifications, truly a grand challenge. Particular direc- 
tions can include the mapping of kinase/substrate rela- 
tionships using modified kinases, starting with specific 
pathways of particular interest in oncology. The same 
approaches will be applied to the entire array of protein 
classes in the long term. White emphasized the need for 
basic biochemical data, such as kD values, in order to 
better understand and model biological processes. 

"Protein Interactions". The area of protein interactions 
was discussed by many participants. Efficient approaches 
and protocols now exist for mapping interactions of full- 
length proteins. With quantitative proteomics, specific 
binders can be distinguished from background binders. 
Importantly, the specific interactions of modified pep- 
tides, DNA, RNA, and small molecules with their target 
proteins can now be addressed in a large-scale format. 
These represent important areas of biology and biotech- 
nology where few alternative techniques exist. 

The effect of individual genetic differences on the 
proteome (or lack thereof) has been the subject of a few 
pioneering studies but is still largely unexplored [24-26]. 
MS -based proteomics is uniquely positioned to measure 
the effects of these differences at the level where it 
counts, namely the level of protein expression or activity 
differences. This work will be essential to translate gen- 
etic differences to differences in pathways and differ- 
ences in how those pathways should be modulated by 
drugs or other means. 

Study design and statistical challenges in clinical 
proteomics (Gilbert S. Omenn, University of Michigan, 
chair) 

Omenn opened this session with comments about chal- 
lenges in developing omics-based tests for cancers and 
other diseases. He emphasized that specification of the 
intended clinical use is the critical first step. Heterogeneity 
of etiology, pathogenesis, and responses to therapy among 
patients with identical diagnoses and heterogeneity within 
tumor masses provide major challenges for developing 
tests aimed at the clinical needs of diagnosis, prognosis, 
and guided therapy. Knowledge of mechanisms can en- 
hance test development by providing a biological founda- 
tion for the test, rather than relying on statistical 
correlations. Integration of data from complementary gene 
expression, genomic, epigenomic, proteomic, and metabo- 
lomic platforms will enhance these complex studies. 



Finally, it should be acknowledged that it is a long path of 
discovery, confirmation, validation, clinical trials, and FDA 
approval to establish test validity and utility and gain reim- 
bursement of the laboratory service. 
Several major statistical challenges were identified: 

a) High-dimensional data with relatively few specimens 
tested in the discovery phase inevitably lead to high risks 
of over-fitting; an extreme case is two pooled specimens. 

b) Multi-site collection of specimens, with pre-analytical 
and analytical variation, generates prominent "lab effects" 
or "batch-effects", which can overwhelm the disease 
associations; however, Nathan Price of the Institute for 
Systems Biology has emphasized the value of analyzing 
multiple laboratories results to estimate variance and 
find a common biomarker signature, c) There are always 
tradeoffs between sensitivity and specificity of test 
results, corresponding to type 1 and type 2 errors (false- 
positives and false-negatives); any claim of 100% sensitiv- 
ity and 100% specificity should be viewed with maximal 
skepticism. A better parameter for a screening test is the 
positive predictive value (PPV), which takes account of 
the intended clinical use and the incidence of true posi- 
tives in the population to be tested, d) Bias can be intro- 
duced in multiple ways (see below); for example, use of 
several equivalent methods, with selective reporting of 
the method that happens, perhaps randomly, to give the 
most favorable results, e) The variable ways of estimating 
false-positive rates in matching peptide sequences from 
mass spectrometry with protein databases; PeptideAtlas 
recommends a rigorous cutoff at 1% FDR (0.01) at the 
protein level, which generally corresponds to 0.16% 
(0.0016) at the peptide level [27]. 

Omenn concluded that new statistical methods and 
conventions are needed to enhance the integrated ana- 
lysis of omics results from multiple platforms. Even if 
the data are collected from specimens on the same indi- 
vidual, compounding of errors and biases is likely. Bio- 
logical knowledge of meaningful, testable pathways and 
networks should help in reducing biases. 

Steven Skates, Massachusetts General Hospital, pre- 
sented "Study Design in Omics Biomarker Research". He 
discussed clinically-derived quantitative goals and 
sources of bias that are threats to the validity of omics- 
based biomarkers. It is common to characterize tests by 
sensitivity (proportion of true positives detected) and 
specificity (proportion of false-positives). It is necessary 
to optimize the combination in light of prevalence of the 
condition to be detected and clinical and ethical import- 
ance of missing the diagnosis (false-negatives) or making 
a false diagnosis (false-positives). A judgment about ben- 
efits of true positive and true negative and harms of false 
positive and false negative results is needed [28]. Perhaps 
this can be done by stating a minimum benefits/harms 
ratio, with input from clinicians. Skates' example of early 
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detection of ovarian cancers (prevalence 1 in 2500 post- 
menopausal women) showed the impact of introducing a 
confirmatory test, like ultrasound, after the molecular 
screening test. How many patients would have to 
undergo testing and then, of those testing positive, sur- 
gery to find one case of ovarian cancer? What would be 
an acceptable ratio? For example, to achieve a ratio of 
five surgeries to one patient with ovarian cancer would 
require test specificity of 98% (2% false positives = 50/ 
2500) plus the 10-fold benefit of ultrasound. He then 
outlined the stories of OvaCheck and OvaSure as tests 
that failed due to bias. The intended use for the FDA 
cleared OVA1 was much narrower, "to assess the likeli- 
hood of malignancy in patients with ovarian adnexal 
mass when surgery is planned and not yet referred to an 
oncologist". He concluded that investigators should avoid 
biased early studies, which set us off in wrong directions, 
as it would be more efficient and more scientifically 
sound to seek high-quality, clearly-unbiased specimens 
for early stage studies. 

Lisa McShane, National Cancer Institute, presented 
"Statistical Issues in the Development of Reliable and 
Clinically Relevant Prognostic and Predictive Proteomic 
Signatures". She discussed practical methods to 
operationalize classifiers, risk scores, or decision trees as 
mathematical models for molecular markers, whether 
RNA, DNA, or proteins [28,29]. It is critical to define 
the intended use across the categories of early detection 
or risk estimation before there is a clinical diagnosis. 
These categories include confirmation, staging, and sub- 
typing upon diagnosis; prognosis or prediction before the 
start of therapy; desired responses and potential toxicity 
from therapy; and post-treatment outcomes, including 
survival and absence or recurrence of disease. Predictive 
signatures refer to treatment effect modifiers. Prognostic 
effects are typically quantified by hazard ratio, while pre- 
dictive effects are typically quantified by ratio of sub- 
group-specific treatment hazard ratios. The goal is to 
create and validate a clinical test from molecular data, as 
has been done with the 21-gene recurrence risk score 
(OncotypeDX) and the Mammaprint 70-gene signature, 
which are used clinically to identify women with such 
low risk of metastasis that adjunct chemotherapy can be 
considered unnecessary. She demonstrated how to an- 
swer such questions as: Is the prognostic information 
sufficiently strong to influence clinical decisions? Does 
the predictor provide information beyond standard prog- 
nostic factors (i.e., "added value")? Proper control groups 
are critical for interpreting results with and without use 
of the marker. Data from randomized clinical trials can 
distinguish benefit of therapy only for marker-positive 
participants from benefit for all participants (in which 
case the marker test may not add value). The process 
requires multiple steps, including lock-down of the assay 



and classifier, then internal validation on suitable speci- 
mens, and then external validation on independent set(s) 
of specimens/data. Expert statistical steps involve feature 
selection and supervised dimension/data reduction. 
Over-fitting is a particularly devastating problem, which 
is predictable; when there are too many parameters rela- 
tive to the number of specimens or patients, the model 
will describe random error or noise instead of an under- 
lying relationship. Leave-one-out and other cross-valid- 
ation methods also must be done expertly. Statistical 
maneuvers cannot overcome built-in biases from variable 
specimen handling and other lab or batch effects. 

The speakers reinforced the importance of identifying 
pathways and networks that make biomarker candidates 
biologically meaningful and credible. There is a special 
opportunity for building on NIH Common Fund pro- 
grams such as the Technology Centers for Networks and 
Pathways and help frame the Biology- and Disease- 
driven components of the global Human Proteome Pro- 
ject (B/D-HPP) being launched by the Human Proteome 
Organization [3]. This would effectively leverage the tre- 
mendous investments around the world that have 
already been made in the mass spectrometry, protein 
-capture reagents, knowledgebase pillars for the HPP, 
and in the HPP chromosome-centric program. 

Proteomic technologies in a clinical setting (Daniel W. 
Chan, Johns Hopkins University School of Medicine, chair) 

Chan discussed the translation of proteomic technologies 
into a clinical setting [30-32]. He presented multiple rea- 
sons for the significant gap between biomarker discovery, 
validation and translation. The development of OVA1, 
the first proteomic IVDMIA (in vitro diagnostic multi- 
variate index assay) cleared by the FDA, was described 
to illustrate the concept of "the four bridges for bio- 
marker translation": 1) clearly define a specific clinical 
"intended use" (unmet clinical needs); 2) generate suffi- 
cient evidence in preliminary studies to justify the invest- 
ment for a large-scale validation trial; 3) select/develop 
assays with performances suitable for clinical use; and 4) 
conduct a pivotal clinical trial to demonstrate clinical 
utility to obtain regulatory approval and to gain accept- 
ance by the clinical community [31]. Chan proposed a 
roadmap for the translation of proteomics technologies 
into clinical settings. The roadmap requires close collab- 
oration between researchers, industries, regulators, clin- 
ical chemists and clinicians, including private-public 
partnerships, to leverage existing NIH programs such as 
the NCI Clinical Proteomics Tumor Analysis Consortia 
(CPTAC) and Early Detection Research Network 
(EDRN) and the NHLBI, NIDDK, and NIAD clinical 
proteomic programs. Dr. Chan suggested that such a 
joint effort would accelerate the development of clinic- 
ally useful technologies and biomarkers and make a 
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significant impact to fulfill the unmet clinical needs for 
patients' personalized health care. 

Barry Dowell, Abbott Laboratories, pointed out key 
challenges in biomarker commercialization: the selection 
of biomarkers to address unmet needs, pre-analytical 
and analytical issues, clinical performance assessment, 
regulatory approval, physician education, launch of new 
products, post-launch studies, and marketing issues. 
These factors are all important to bring a new product 
(biomarker) to market. Key considerations for a success- 
ful product include clinical utility, reagent availability, 
performance characteristics, patent status with freedom 
to operate, and licensing terms. He emphasized pre-ana- 
lytical considerations of specimen collection, biomarker 
stability and its specific forms in blood, as well as bio- 
logical variability. Analytical performance depends on 
establishing assay design requirements, identifying key 
reagent components, and optimizing test procedures and 
manufacturing processes. Clinical performance requires 
establishing specimen collection Standard Operating 
Procedures and study designs with sufficient statistical 
power. FDA submissions should use multi-site studies 
with appropriate patient specimens for the specific 
"intended use". Finally, companies marketing new bio- 
markers face different regulatory and re-imbursement 
processes from country to country, competition from 
multiple biomarkers, and the need for clinical studies to 
convince the medical community of the value of the new 
biomarker. 

Darryl Palmer-Toy, Southern California Kaiser Perma- 
nente, presented a clinical laboratory practitioner per- 
spective on proteomic biomarker discovery. Taking a 
biomarker from the research laboratory into the routine 
clinical laboratory requires proactive three-way collabor- 
ation involving the research lab, the diagnostics industry 
and the clinical laboratory [29]. He stated that it's "a jun- 
gle out there" in clinical labs, and "shiny new biomarkers 
quickly lose their luster". He pointed out the importance 
of properly collected patient specimens to obtain correct 
analytical results and correct clinical decisions. An ideal 
assay should be tough enough to stand up to abuse. 
Sometimes, consistent results are more important than 
"true" results (e.g. the hemoglobin A1C test for monitor- 
ing diabetes). Diagnostic tests should provide clinically 
useful information not available by other means and at a 
reasonable cost, including quality control. 

Maria M. Chan, Food and Drug Administration, gave 
the FDA perspective on proteomic biomarker/technology 
translation. FDA regulates In Vitro Diagnostics (IVDs) 
including reagents, instruments and systems using 
human specimens. FDA uses a risk-based classification 
system based on the risk to the patient due to false 
results: class I with low risk is exempted; class II with 
medium risk requires 510(k) pre-market notification; 



class III with high risk requires Pre-Marketing Approval 
(PMA). "Intended Use" determines the FDA classifica- 
tion, the review path and the type of study required. The 
basis of device review by FDA is the balance between 
safety and effectiveness. All IVDs must have adequate 
analytical and clinical performances and meet labeling 
requirements on, intended use, warnings, limitations, in- 
terpretation of results and performance summary. For 
proteomics IVDs, one should use patient specimens with 
results spanning entire concentration ranges; the per- 
formance at the cut-off value is critical. Precision, limit 
of detection, specificity, matrix effect, accuracy, stability 
and pre-analytic variables are performance criteria. Clin- 
ical validation should include a study design with target 
populations from a minimum of 3 sites, sample size jus- 
tification, patient selection criteria and a pre-specified 
hypothesis. Other pre-study considerations include ap- 
propriate statistical plans, with the training set different 
from the validation set; consideration of possible con- 
founding co-variables; and completion of analytical valid- 
ation preceding clinical validation. She recommended a 
Pre-IDE as a very useful tool for a company to obtain 
free protocol review by FDA and to gain advice on regu- 
latory process and feedback on proposed studies. This 
will prevent unnecessary waste of time and resources. Fi- 
nally, she mentioned new FDA draft guidance for com- 
panion diagnostics (issued 12 July 2011) and for 
research-only or investigation-only (RUO and IUO) pro- 
ducts (issued 1 June 2011), as well as several IVDs 
cleared by the FDA for leukemia, breast, ovarian, pros- 
tate and lung cancers. 

Conclusions from general discussions 

A session within the workshop was fully dedicated to 
discuss potential scientific opportunities and identify 
compelling ideas for future developments in the field of 
proteomics. Several potential concepts were discussed 
and considered in the context of the presentations by the 
speakers. After evaluation of several important and valu- 
able opportunities, the idea of generating a human prote- 
ome network emerged as the most compelling 
opportunity. However, it was also emphasized that, it 
would be important to further support technology devel- 
opment aimed at a more comprehensive characterization 
of the proteome, further develop and support proteomic 
data resources, and work toward an inter-agency road- 
map that would facilitate the translation of proteomic 
discoveries. 

Human proteome networks in health and disease - A major 
scientific opportunity 

The Human Genome Project was initiated almost 25 
years ago. Its findings and its approaches have 
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transformed much of biomedical research and clinical 
genetics practice. Connecting genomics knowledge to 
phenotypes is critical for common diseases; research on 
functional genomics and gene/environment interactions 
requires understanding and assays of proteins and protein 
networks. The genome parts list has inspired a corre- 
sponding approach to the identification and 
characterization of protein products; rapid progress re- 
cently has brought us to antibody evidence and immuno- 
histochemical tissue localization of ~ 12,000 of the 
-20,000 predicted protein gene products in the Human 
Protein Atlas and mass spectrometry-based evidence in 
SwissProt/UniProt for about 13,000 of the -20,000 pro- 
tein products. Completion of the protein parts list is 
the primary goal of The Human Proteome Project, 
using both chromosome-centric and biology/disease- 
driven approaches [3]. That leaves enormous work to 
be done on protein isoforms, dynamic regulation of 
protein expression, and interactions of proteins with 
macromolecules and small molecules critical to cellular 
and organismal function. 

Projecting ahead 25 years, the participants of this 
Workshop envision that one of the most promising out- 
comes of this modern biomedical research, and especially 
a focus on protein networks, could be the transformation 
of health care into a predictive, preventive, personalized 
and participatory system of care ("P4 medicine") [33]. 
Key pillars of P4 medicine are emerging from omics- 
based research and the field of systems biology. Inter- 
pretation of complete personal genome sequence data 
will require a much better knowledge of how gene pro- 
ducts encoded by the human genome interact with each 
other to contribute to complex molecular interactome 
networks and cellular systems that underlie the biology 
of our tissues and organs. Extensive understanding of the 
functional, dynamic and logical relationships taking place 
in the context of complex interactome networks will 
eventually drive two major aspects of P4 medicine: inte- 
grated biomarker discovery and systems pharmacology. 
Biomarker discovery will improve the predictability of 
specific diseases by integrating personal genomics infor- 
mation, knowledge of environmental components and 
understanding of the properties of cellular systems. Simi- 
larly, safer and more predictably effective personalized 
therapies will emerge from understanding of complex 
relationships between proteins and cellular networks. 

What became increasingly clear throughout this Work- 
shop is that one critical component missing for this vi- 
sion to eventually become reality is a freely-available 
global map of the human proteome in terms of macro- 
molecular interactions between its components, i.e. a wir- 
ing diagram of functional relationships between genes and 
gene products. We see the development of a Human Pro- 
tein Network as a major scientific opportunity. A nearly 



complete map of human protein-protein, protein-nucleic 
acid, and protein-small molecule interactions could be 
generated and this information could be combined with 
biologically-driven findings to complete the human prote- 
ome parts list and functional networks for the -20,000 
protein-coding genes and their products. Comparative 
analyses in model organisms would enhance the human 
studies. One of the major outcomes of such interactome 
maps would be a wiring diagram that could be used to 
make sense of complex traits starting from currently avail- 
able Genome- Wide Association studies (GWAS), at least 
those variants producing non-synonymous mutations in 
protein-coding genes, cancer genome sequencing efforts 
such as The Cancer Genome Atlas (TCGA), and key con- 
sortia such as ENCODE. Interactome network maps 
would serve as foundational information for systems biol- 
ogy, enhance understanding of the pathways between gen- 
otypes and phenotypes, and improve the predictive power 
of integrated and personalized biomarkers, therapies, and 
combinations of therapies. 

Models exist as to how one could organize a global 
large-scale study, inspired by the Human Genome Pro- 
ject and its many follow-on initiatives. In terms of the 
specific needs to address the dynamic aspects of prote- 
ome function, rather than the linear and binary informa- 
tion in DNA sequences, we find inspiration in the 
example of the ENCODE consortium. ENCODE uses 
defined cell lines to generate high-throughput unbiased 
systematic datasets and maps of protein-DNA interac- 
tions, but leaves biological studies and functional follow- 
ups to other granting mechanisms. 

A Network Biology/Interactome Mapping Project can 
be defined in terms of end-point goals and intermediate 
milestones using empirical frameworks. For a compre- 
hensive human binary protein-protein interactome net- 
work, it has been established that on the order of 
-150,000 interactions are to be found in what will con- 
stitute the "Reference" interactome network [34]. At this 
stage, the combination of low-throughput and high- 
throughput datasets curated by databases such as Bio- 
Grid (see above) indicate that the community has 
assembled about 20% of that number of high quality 
interactions of the Reference interactome network. In 
other words approximately 80% of the interactome 
remains to be mapped. 

Static maps of macromolecular interactions need to be 
combined with network-based datasets consisting of 
other types of functional and dynamic relationships be- 
tween genes and gene products, such as: i) protein ex- 
pression data by measuring precisely the proteome 
content of particular cell lines and tissues, as demon- 
strated elegantly by Mann during the Workshop, ii) tissue 
and subcellular localization data obtained using both 
mass spectrometry and immunohistochemistry, iii) 
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kinase-substrate or other post-translational enzyme-tar- 
get relationships as measured by protein arrays, and then 
iv) translation of the corresponding gene-gene network 
information into a better description of tissue lysates and 
body fluids for integrated biomarker discovery efforts. 

In terms of future directions, it will be crucial to 
organize this map such that it can be used to generate 
dynamic models that can be integrated with both Men- 
delian and complex multifactorial diseases. It will be ex- 
tremely important to democratize access to peptide and 
protein data, interactome networks data, and powerful 
analytical platforms. The Workshop participants agreed 
that it is feasible to generate a human macromolecular 
interaction map with the technology that is available 
today. However, developing new technology platforms 
for proteomic analysis is needed to better characterize 
the heterogeneity of tissues and tumors. 

In summary, a major obstacle to being able to move 
forward with integrated biomarker discovery and systems 
pharmacology is the generation of a high quality, freely 
available and nearly complete map of the human interac- 
tome network. This rapidly emerging scientific challenge 
is also a great scientific opportunity. Such a project 
would benefit from and go beyond the example of the 
ENCODE consortium, particularly in terms of shared 
goals by the participants and shared quality standards to 
map the interactome network. A protein interactome 
networks effort would build on previous and current 
trans-NIH Common Fund investments, including the 
National Centers for Biocomputing, the Technology 
Centers for Networks and Pathways, the Library of Inte- 
grated Network-based Cellular Signatures, the Protein 
Capture Reagents, and the Interdisciplinary Research 
Consortia. If implemented to a sufficiently large extent 
this interactome mapping project could have a high im- 
pact by producing a systematic, unbiased, freely available 
wiring diagram for a systems biology-based implementa- 
tion of P4 medicine. 
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